GETracker3: A Robust, Lightweight Topic Tracking System

نویسندگان

  • Tomek Strzalkowski
  • G. Bowden Wise
  • Amit Bagga
  • Gees C. Stein
چکیده

We describe a topic tracking system developed at GE Corporate R&D Center in connection with our participation in DARPA TDT3 evaluations. The TDT tracking task is specified as follows: Given Nt training news stories on a topic, the system must find all subsequent stories on the same topic in all tracked news sources. These sources include radio and television news broadcasts, as well as newswire feeds. The initial set of training stories (usually 1, 2 or 4) is the only information about the topic available to the tracking system. The tracking performance is gauged using the False Alarm Rate and the Miss Rate metrics, reflecting the incidence of incorrect classification decisions made by the automatic system. 1. OVERALL DESIGN The GETracker operates by first creating a topic tracking query (TQ) out of the Nt available training stories. During tracking, each incoming story is processed and assessed for relevance with respect to the tracking query. Stories which exceed the empirically established threshold are classified as being “on topic”. For TDT2 we developed a tracker using lightweight, extremely portable and robust algorithms that rely on content compression rather than on corpus statistics to detect relevance and topicality of source material. The tracker made use of our single document summarizer to compress the content of each incoming story. Stories whose summaries cleared the empirically established threshold are classified as being “on topic”. Although, we achieved reasonably high-accuracy performance using a summarization-based approach in TDT2, we decided to change our strategy to use a more traditional statistical (IR) approach for TDT3 in an attempt to improve performance. Our experiments and results for TDT2 indicated that the use of colocations helps performance during tracking. For TDT3 we felt that using co-locations along with a suitable combination of weights would allow us to achieve more performance. We therefore decided to use a more traditional statistical (IR) approach and explore how best to combine the following strategies: Use of co-locations. Impact of different weighting (tf*idf) schemes. Effect of pre-training corpora. For TDT3 our general approach is to form a tracking query using the Nt training stories. During tracking, a query vector and a document vector are derived for each document tracked. A cosine similarity metric is then used to determine whether a document is on topic. 2. BUILDING THE TOPIC TRACKING QUERY The topic tracking query (TQ) is built out of available training material, which consists of Nt news stories on a topic of interest. In TDT3 evaluations, the default value of Nt has been 4 topical stories. The initial tracking query TQ0 is formed out of the most frequent non-stop words and collocations in the training set. The size of the tracking query (number of terms) is chosen based on the total number of non-stop words found across all Nt training stories. If a term frequency is larger than a preset threshold h, that term is added to the tracking query: h = blog(D)c F ; where D = total number of non stop words (including repetitions) F = 1 In addition, co-locations are all pairs of TQ0 terms that occur in 2 or more training stories. At this time, all within-the-story co-occurrences are collected. More advanced proximity or codependency calculations are planned for the future. During TDT2, we also experimented with adaptive tracking query updating, by periodically updating the tracking query terms as tracking progressed. However, we found that this approach did not help performance, so we did not consider adaptive tracking query updating for TDT3. 3. PRE-TRAINING AND DYNAMIC IDF There are a few (around 100 or less) stories between the last of the Nt training stories and the first tracking story. Pre-training is done by processing these stories for within-document frequencies. In addition, these documents are merged to initialize overall statistics before tracking begins. These statistics are updated as stories are tracked so that idf values may be computed dynamically. We use the following formula for computing idf for each stem: idf(stem) = log (NDOCS=df(stem)) where NDOCS = total number of docs seen df(stem) = number of docs containing stem Once the pre-training stories have been processed, df values will be initialized for all non-stop words seen and NDOCS will equal the number of pre-training documents seen. Once tracking begins, NDOCS and df are dynamically updated as each document is tracked. 4. TRACKING During tracking, within-document statistics are computed for each story and then merged with overall statistics so that idf values may be computed dynamically as explained above. Once statistics have been updated, two vectors are formed a query vector: QW = fwqig a document vector: DW = fwdig The cosine similarity between these two vectors is then used to ascertain whether the document is on or off topic. SIM(DW;QW) =X (wdi wqi) (jDWj jQWj) where jDW j =p(wd21 + :::wd2n) and, jQW j =p(wq2 1 + :::wq2 n) When co-locations were found within a document, a premium was added to the cosine score. score = simScore + colocScore When the score exceeds the threshold (explained below) the document is considered to be on topic. We explored two different weighting schemes for the query and document vectors. 4.1. Weighting Scheme 1 For the first scheme, both the document and query vectors contain an idf factor. A document vector is formed by computing weights for each stem, si, found in the document: DW = fwdi = tfd(si) idf(si)g A query vector is also formed for those stems si in the document that are also in the tracking query: QW = fwqi = tfq(si) idf(si)g where tfd(si) = frequency of stem si across all documents seen tfq(si) = frequency of stem si across all Nt training docs idf(si) = log(NDOCS=df(si)) 4.2. Weighting Scheme 2 After examining the literature and other strategies used by TDT participants, we thought that having an idf factor in both the query and document vectors may not be desirable. We, therefore, modified our weighting scheme so that only query weights included an idf factor. The document weights do not include an idf factor, but are normalized. A document vector is formed by computing weights for each stem, si, found in the document: DW = fwdi = log(tfd(si))g A query vector is also formed for those stems si in the document that are also in the tracking query: QW = fwqi = tfq(si) idf(si)g where tfd(si) = frequency of stem si across all documents seen tfq(si) = frequency of stem si across all Nt training docs idf(si) = log(NDOCS=df(si)) However, since we did not see any significant gain when using this scheme, we decided to use weighting scheme 1 for our experiments. 5. USE OF ADDITIONAL CORPORA We experimented with the use of an additional pre-training corpus using 2 months of data from the TDT3 corpus. By pre-computing frequency statistics from a corpus, tracking could begin with idf values already initialized/stabilized. When an additional corpus is used the following associated statistics are computed NDOCScorpus = total number of docs seen in corpus dfcorpus(stem) = number of docs in corpus containing stem and idf is calculated as as follow: idf(stem) = log (NDOCS+ NDOCScorpus) (df(stem) + dfcorpus(stem)) When no additional corpus is used, NDOCScorpus and dfcorpus are zero. We also experimented with the use of separate pre-training corpora for English and Mandarin stories. We only performed experiments using a subset of the tracking runs but our results indicated that pretraining did not seem to help performance very much. 6. CO-LOCATIONS The query vector QW contains weights for each tracking query term found in a document. When a co-location pair is also found within a document, the weights in QW are used to create an additional premium: Let (si; sj) be a co-location pair: premium(si; sj) = F wqi wqj=(2 C) where C = 1 F = df(si; sj)=Nt df(si; sj) = number of training docs the pair appeared in A co-location score found by computing the average premium score over pairs found within a document: colocScore = P(premium(si; sj)) numPairs MAX PREMIUM where MAX PREMIUM = 0:1 We have found that the use of co-locations helps performance of topic tracking. 7. THRESHOLDING For each document tracked, the cosine similarity score of the query and document vectors, with the addition of any co-location premium, is the score for the document. This score must meet or exceed a threshold to be “on topic”. We used the following adaptive threshold based on the size of the tracking query: threshold =8>><>>: 3:0 jQj=100:0 if jQj <= 6 2:0 jQj=100:0 if jQj <= 15 1:2 jQj=100:0 if jQj <= 25 1:0 jQj=100:0 if jQj <= 35 0:6 jQj=100:0 if jQj > 35 8. EXAMPLES We illustrate the inner-workings of our tracker using some examples from our TDT3 final evaluation run. 8.1. Topic 3007: Congo Rebellion The Nt=4 training stories produce the following tracking query (TQ) consisting of 30 terms: rebel, congo, kindu, tutsi, kabila, eastern, town, kilomet, mile, govern, soldier, base, congoles, troop, air, command, fighter, kinshasa, rebellion, rwandan, control, fight, forc, goma, kalima, mondo, month, refuge, took, war Examining our tracking output file for this topic, we identify the following YES response for a story: as1/19981231 200

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ge.tracker: a Robust, Lightweight Topic Tracking System

We describe a topic tracking system developed at GE R&D Center in connection with our participation in DARPA TDT evaluations. The TDT tracking task is specified as follows: given Nt training news stories on a topic, the system must find all subsequent stories on the same topic in all tracked news sources. These sources include radio and television news broadcasts, as well as newswire feeds. The...

متن کامل

An Adaptive-Robust Control Approach for Trajectory Tracking of two 5 DOF Cooperating Robot Manipulators Moving a Rigid Payload

In this paper, a dual system consisting of two 5 DOF (RRRRR) robot manipulators is considered as a cooperative robotic system used to manipulate a rigid payload on a desired trajectory between two desired initial and end positions/orientations. The forward and inverse kinematic problems are first solved for the dual arm system. Then, dynamics of the system and the relations between forces/momen...

متن کامل

Robust Sliding Mode Controller for Trajectory Tracking and Attitude Control of a Nonholonomic Spherical Mobile Robot

Based on dynamic modeling, robust trajectory tracking control of attitude and position of a spherical mobile robot is proposed. In this paper, the spherical robot is composed of a spherical shell and three independent rotors which act as the inner driver mechanism. Owing to rolling without slipping assumption, the robot is subjected to two nonholonomic constraints. The state space representatio...

متن کامل

Nonlinear Robust Tracking Control of an Underwater Vehicle-Manipulator System

This paper develops an improved robust multi-surface sliding mode controller for a complicated five degrees of freedom Underwater Vehicle-Manipulator System with floating base. The proposed method combines the robust controller with some corrective terms to decrease the tracking error in transient and steady state. This approach improves the performance of the nonlinear dynamic control scheme a...

متن کامل

Design of robust carrier tracking systems in high dynamic and high noise conditions, with emphasis on neuro-fuzzy controller

The robust carrier tracking is defined as the ability of a receiver to determine the phase and frequency of the input carrier signal in unusual conditions such as signal loss, input signal fading, high receiver dynamic, or other destructive effects of propagation. An implementation of tight tracking can be understood in terms of adopting a very narrow loop bandwidth that contradict with the req...

متن کامل

Tracking Control of Uncertain Non - Iinear MIMO System Using Modified Sliding Surfaces for Attitude Large Maneuver of Satellites on Orbit

Designing a robust tracking control for a non-linear MIMO system with uncertainty is one of the most complicated control problems. In this paper, sliding mode changed to non-linear controllable canonical form by input-output linearization. This, sliding surfaces can be defined in a way that we can de-couple equations and indicate the sliding conditions of multi-variable controller system. The u...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999